Skip to content

Adding the option to disable the DNS processor failure or success cache#44932

Merged
andrewkroh merged 17 commits intoelastic:mainfrom
mjmbischoff:dns-cache
Jun 26, 2025
Merged

Adding the option to disable the DNS processor failure or success cache#44932
andrewkroh merged 17 commits intoelastic:mainfrom
mjmbischoff:dns-cache

Conversation

@mjmbischoff
Copy link
Copy Markdown
Contributor

@mjmbischoff mjmbischoff commented Jun 19, 2025

Proposed commit message

Adds the option to disable the success and failure cache.

Motivation

This is to enable use cases that require capturing the current point in time dns record regardless of cache or ttl of the record. Such as the case of monitoring the dns server, or with recorded events that need to capture the current state of the environment. TTL captures the time frame over which the old value might be used over the current DNS record, in other words the frame time in which the agent might observe the old or new record based upon whenever the previous request was made. This unpredictability can be undesired when optimizing time-to-intervention.

Disabling the cache will have throughput implications, serial processing an event will be greater than DNS roundtrip time. For example if round-trip time to perform an DNS request is 1 ms, max throughput it limited to 1000/sec. Known use cases have are low throughput requirements. Parallelization, by for example deploying multiple agents, can be used to stretch this number. We would urge to reevaluate the use case and the use of the cache at this point.

NOTE: setting the ttl on the failure cache to 1ns achieves a similar, but imperfect effect.
NOTE: setting the ttl on the success cache is a valid option as per code, it is however ignored as also document in the code. in the documentation it is omitted as an option. Honoring setting and the ttl (min(ttl, dns_record_ttl)) is a different route. Similar to other dns client behaviour.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

non known, the default values leave the old behavior intact and the setting to trigger the new behavior is added in this PR

How to test this PR locally

Define the DNS processor, observe cache stats / resolver requests.

Related issues

@mjmbischoff mjmbischoff requested a review from a team as a code owner June 19, 2025 14:20
@botelastic botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 19, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 19, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @mjmbischoff? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mjmbischoff mjmbischoff changed the title Adding the option to disable the DNS failure or success cache Adding the option to disable the DNS processor failure or success cache Jun 19, 2025
- QF1008, while I disagree with removing the additional qualification as it makes things more readable, removing the qualifier to appease the linter god.
Copy link
Copy Markdown
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add to the proposed commit message to explain the why part and what is the use case for cache disablement.

- WHY: the rationale/motivation for the changes

Turning off the caching will significantly limit the throughput of the pipeline. Even if each request takes 1ms to complete, that means the maximum throughput is 1000 EPS.

Also, the documentation for the processor will need updated to include the new configuration parameter.

Comment thread libbeat/processors/dns/config.go Outdated
@mjmbischoff
Copy link
Copy Markdown
Contributor Author

mjmbischoff commented Jun 21, 2025

Can you please add to the proposed commit message to explain the why part and what is the use case for cache disablement.

- WHY: the rationale/motivation for the changes

Turning off the caching will significantly limit the throughput of the pipeline. Even if each request takes 1ms to complete, that means the maximum throughput is 1000 EPS.

Also, the documentation for the processor will need updated to include the new configuration parameter.

Added motivation.

TODO: documentation Added documentation c5de66a ab103b2 e1f60c9

- document Enabled settings
- Notes with warnings on throughput and compounding effects
@mjmbischoff mjmbischoff requested a review from a team as a code owner June 21, 2025 09:58
- document Enabled settings
- Notes with warnings on throughput and compounding effects
Copy link
Copy Markdown
Contributor

@colleenmcginnis colleenmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions below.

Comment thread docs/reference/auditbeat/processor-dns.md Outdated
Comment thread docs/reference/auditbeat/processor-dns.md Outdated
Comment thread docs/reference/auditbeat/processor-dns.md Outdated
Comment thread docs/reference/auditbeat/processor-dns.md Outdated
mjmbischoff and others added 2 commits June 24, 2025 08:52
Co-authored-by: Colleen McGinnis <colleen.j.mcginnis@gmail.com>
Comment thread libbeat/processors/dns/config.go Outdated
Comment thread libbeat/processors/dns/cache.go Outdated
Comment thread libbeat/processors/dns/dns_test.go Outdated
@mjmbischoff mjmbischoff removed enhancement needs_team Indicates that the issue/PR needs a Team:* label labels Jun 25, 2025
@mjmbischoff mjmbischoff added backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches and removed enhancement needs_team Indicates that the issue/PR needs a Team:* label labels Jun 25, 2025
@botelastic botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 25, 2025
@andrewkroh andrewkroh merged commit eee15e7 into elastic:main Jun 26, 2025
203 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

@Mergifyio backport 9.0 9.1

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 26, 2025

backport 9.0 9.1

✅ Backports have been created

Details

mergify Bot pushed a commit that referenced this pull request Jun 26, 2025
This enables use cases that require resolving the current DNS record,
regardless of the record's TTL or any previously cached values. It is
useful, for example, when monitoring a DNS server or when recorded
events must capture the environment's state at a specific moment.

When a cache is used, the TTL determines the time frame in which an
agent might observe a stale record instead of the current one. This
unpredictability can be undesirable when optimizing for rapid
time-to-intervention.

Disabling the cache has significant throughput implications. The
processing time for a single event will be at least the DNS round-trip
time. For example, if a DNS request takes 1 ms, the maximum serial
throughput is limited to 1000 events/sec. Known use cases for this
feature have low throughput requirements. Throughput can be increased
by deploying multiple, parallel agents.

NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns)
achieves a similar, but imperfect, effect.

NOTE: While the config allows setting a TTL on the success cache, this
option is currently ignored. A future enhancement could honor this
setting (e.g., by using min(configured_ttl, record_ttl)), which would
align with the behavior of other DNS clients.

(cherry picked from commit eee15e7)

# Conflicts:
#	libbeat/processors/dns/dns_test.go
mergify Bot pushed a commit that referenced this pull request Jun 26, 2025
This enables use cases that require resolving the current DNS record,
regardless of the record's TTL or any previously cached values. It is
useful, for example, when monitoring a DNS server or when recorded
events must capture the environment's state at a specific moment.

When a cache is used, the TTL determines the time frame in which an
agent might observe a stale record instead of the current one. This
unpredictability can be undesirable when optimizing for rapid
time-to-intervention.

Disabling the cache has significant throughput implications. The
processing time for a single event will be at least the DNS round-trip
time. For example, if a DNS request takes 1 ms, the maximum serial
throughput is limited to 1000 events/sec. Known use cases for this
feature have low throughput requirements. Throughput can be increased
by deploying multiple, parallel agents.

NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns)
achieves a similar, but imperfect, effect.

NOTE: While the config allows setting a TTL on the success cache, this
option is currently ignored. A future enhancement could honor this
setting (e.g., by using min(configured_ttl, record_ttl)), which would
align with the behavior of other DNS clients.

(cherry picked from commit eee15e7)
@mjmbischoff mjmbischoff deleted the dns-cache branch June 26, 2025 21:24
andrewkroh pushed a commit that referenced this pull request Jun 27, 2025
… (#45078)

This enables use cases that require resolving the current DNS record,
regardless of the record's TTL or any previously cached values. It is
useful, for example, when monitoring a DNS server or when recorded
events must capture the environment's state at a specific moment.

When a cache is used, the TTL determines the time frame in which an
agent might observe a stale record instead of the current one. This
unpredictability can be undesirable when optimizing for rapid
time-to-intervention.

Disabling the cache has significant throughput implications. The
processing time for a single event will be at least the DNS round-trip
time. For example, if a DNS request takes 1 ms, the maximum serial
throughput is limited to 1000 events/sec. Known use cases for this
feature have low throughput requirements. Throughput can be increased
by deploying multiple, parallel agents.

NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns)
achieves a similar, but imperfect, effect.

NOTE: While the config allows setting a TTL on the success cache, this
option is currently ignored. A future enhancement could honor this
setting (e.g., by using min(configured_ttl, record_ttl)), which would
align with the behavior of other DNS clients.

(cherry picked from commit eee15e7)

Co-authored-by: Michael Bischoff <mjmbischoff@controplex.com>
Co-authored-by: Visha Angelova <91186315+vishaangelova@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches enhancement libbeat needs_team Indicates that the issue/PR needs a Team:* label :Processors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants